SharePoint Data Sources

You can add a SharePoint data source to the KB, enabling the extraction of paragraphs from Excel, Word, and PDF documents within SharePoint site libraries.

NOTE: DRUID 9.1 and higher supports video file extraction. For requirements, limitations, and how the extraction works, see Extracting Data from Video Files.

General Prerequisites

To connect SharePoint to DRUID, your Microsoft Azure subscription administrator needs to navigate to MS Azure > SharePoint File Discovery and provide the following details:

Tenant ID
Client ID
Client Secret

Depending on the SharePoint content, you want to grant DRUID access to, specific configurations are required.

Granting DRUID Access to All SharePoint Content

To allow DRUID access to all content in your SharePoint, your Azure administrator must:

Navigate to MS Azure > SharePoint File Discovery.
Grant the Sites.Read.All API permission.

Granting DRUID Access to a Specific SharePoint Site

For site-specific access, your Azure administrator needs to create two app registrations in the Microsoft Azure Portal:

Main app: Used to assign permissions (e.g., ManagerApp).
Client app: Used to access the SharePoint site content (e.g., ClientApp).

To grant access to a specific SharePoint site:

Get the main app ID and write it down. You’ll need this later.

Get the main app secret id and write it down. You’ll need this later.

Grant API Permissions to the main app. Assign the Sites.FullControl.All permission in Microsoft Graph to the main app.

Get the client app ID and write it down. You’ll need this later.

Get the client app secret and write it down. You’ll need this later.

Grant API Permissions to the Client App. Assign the Sites.Selected permission in Microsoft Graph to the client app.

Configure SharePoint permissions. To simplify the process, you can this Postman collection to make the required API calls:

Get the access token. Make a POST request to the following API using the main app registration (ManagerApp) to obtain an access token:

https://login.microsoftonline.com/{tenant_id}/oauth2/v2.0/token

Get site id. Make a GET request to the below url by using the above token for authentication as a Bearer token:

https://graph.microsoft.com/v1.0/sites/companyname.sharepoint.com:/sites/sitename

You will obtain an id formed from {hostname},{siteCollectionId},{siteId} (E.g., "id":"companyname.sharepoint.com,c5acaa50-8617-2c1a-a68c-6c92f7649d29,6a3638b2-5683-4221-82e9-6e701eb1c318"), take only the siteCollectionId (e.g., c5acaa50-8617-2c1a-a68c-6c92f7649d29) and use it in the following steps (siteId).

Assign site permissions. Make a POST request to the following API using the access token from the previous step:

https://graph.microsoft.com/v1.0/sites/{siteId}/permissions

Replace the {siteId} with the value you copied at step 7.b for the siteCollectionId.

Adding SharePoint data sources and extract data

This section describes how to add a SharePoint data source and extract paragraphs from documents stored in SharePoint libraries.

NOTE: You can authenticate with SharePoint in two ways: using a SharePoint client secret or a DRUID-generated certificate. If you want to authenticate with SharePoint by using a DRUID-generated certificate, you should follow the procedure described in section Authenticate with SharePoint by using a DRUID-generated certificate before adding the data source; otherwise, verifying the data source credentials at creation time will fail.

Step 1: Create the data source

Click the Add New button. The Add New Data Source page opens.
In the Name field, provide a name for the data source. It will help you easily identify and search for a data source.
From the Language drop-down, select the language of the data you upload. It must be one of the bot languages.
From the Type drop-down, select SharePoint. After you select the type, the content of the page displays additional fields required for adding a SharePoint data source.

In the Content Location (URL) field, provide the URL of the SharePoint site from which DRUID will extract data.

To find the SharePoint Content Location (URL), follow these steps:

Right-click on the desired site library/folder/document in Microsoft SharePoint and select Details.

In the right-side panel, click More details.

Scroll down the right-side panel until you locate the Path field, and then click the copy icon next to it. This is the Content Location (URL).

In the Source type field, select either SharePoint Online or SharePoint 2019.

NOTE: SharePoint on premises is available in DRUID 7.13 onwards. In earlier versions, the Source Type field won't be available because you can only connect to SharePoint Online data sources.

Enter the Tenant Id and Client Id.
For SharePoint 2019, you can crawl sites and subsites from a specific library by entering the library path in the field Document Library Path. This field is available in DRUID 8.3 and higher
You can authenticate with SharePoint Online in two ways: using a SharePoint client secret or a DRUID-generated certificate.

Client Secret: Enter the client secret provided by your Microsoft Azure subscription administrator in the Client Secret field.
DRUID-Signed Certificate: Select Use Certificate, then click the Create button next to the Certificate field. In the Create new certificate pop-up enter a certificate name (you will use to identify the certificate in the Druid Portal), select the certificate expiry date from the Ending at field and click the Create button.

The newly created certificate is automatically selected in the Certificate field.

NOTE: Authenticating with SharePoint using a DRUID-generated certificate is available in DRUID 7.15 and higher.

HINT: If you create the certificate during data source creation, testing the data source credentials immediately will fail because the certificate needs to be imported into Azure first. We recommend following the procedure described in section Authenticate with SharePoint by using a DRUID-generated certificate.

Optionally, set the Min score threshold and the Target match score for the data source. If not set, the thresholds from the Knowledge Base will apply.
To verify the SharePoint credentials, click the Test button. If the check fails, check and review the SharePoint credentials to ensure they are correct. You can also verify the SharePoint credentials later by going to the Details tab of the SharePoint data source and clicking the Test button at the bottom of the page.
Click Create. The SharePoint data source appears on the Knowledge base page.

Step 2. Crawl the SharePoint data source

On the Knowledge base page click on the website data source. The data source page displays by default on the Extracted paragraphs tab.

Click the Start crawling button () at the top left corner of the data source. The Start Crawling Parameters page appears.

Define the crawling policy by setting the parameters described in the table below.

Parameter	Description
URL	Automatically populated with Content Location (URL) you specified when adding the data source.
Depth	The number of directory levels the crawler will explore from the URL. NOTE: To improve crawling efficiency, crawl each node individually instead of the entire root, especially if the storage has a deep structure. Set the depth to '0' to achieve this.

After you define the crawling policy, click Start.

HINT: Based on the crawling policy set, it might take up to a few minutes for the crawling to complete. You might want to refresh from time to time to see when the action has completed.

As the crawler visits the link provided in the Content Location (URL) field, it will identify all the hyperlinks in the retrieved web pages and will add them to the list of URLs to visit.

NOTE: Starting with DRUID 8.1, for SharePoint Online data sources, the crawler also identifies subsites and displays them as folders in the Data Source Tree Explorer.

To crawl a specific node, click the dots next to the desired node in the file repository explorer and select Crawl Path.

Step 3. Extract the text articles

To ensure that only relevant content is captured and added to the Knowledge Base, all nodes are excluded from scrapping by default. You can extract paragraphs from the entire SharePoint library (the data source root) or from a specific library element (node / leaf).

To extract data from the SharePoint library, on the Extracted Paragraphs tab of your SharePoint data source, click the Extract button () at the top left corner of the page.

You can select the pages DRUID will extract information from during the extraction process. To include in the scrapping specific pages, click the dots next to the desired file explorer element and select Include. Subsequently, you can exclude certain pages from scrapping by clicking the dots next to the desired file explorer element and select Exclude, then click . The pages excluded from scrapping appear on the Details tab, in the Exclude from scrapping area.

NOTE: The extraction might take a few seconds, depending on the number of links included in the scrapping.

To extract data only from specific tree elements (node/folder, leaf/file), select them and from the bulk action icon, select Extract selected.

If you want to extract information only from a specific tree element, click the dots next to the tree element and select Extract.

NOTE: Starting with DRUID 7.15, the platform extracts data from Excel files with table headers that include spaces (e.g., "Question " and "Answer "). The platform automatically trims these spaces to ensure accurate data extraction. Additionally, it extracts data from Excel documents with multiple sheets, capturing the sheet name in the "sheetName" property for each extracted article.

When the extraction completes, the extracted paragraphs display under Extracted paragraphs > Content tab.

Step 4. Train the data source

To ensure the KB Engine searches through the data source articles, it's crucial to train your data source. Click the Train button at the top-left corner of the data source or select Train data source from the actions menu. Alternatively, you can Train all data sources.

HINT: If you've updated the Trainable Elements in a data source's Advanced Settings, the data source status doesn't change. You need to Retrain the data source to apply those changes. You also have the option to Retrain all data sources directly from the data source. The Retrain feature is available in DRUID 8.15 and higher.

Testing the data source performance

Testing the performance of a data source is important because it ensures that the extracted articles are relevant. This process helps identify and rectify any issues, improving the overall quality and effectiveness of your bot's responses. By validating the data source performance, you can enhance user satisfaction.

To test the performance of the data source, on the Extracted Paragraphs page, in the User Says area, enter a question and select the language. All matched articles will be displayed along with their scores.

You can improve the performance of the data source by reviewing and editing the articles based on your needs.

Editing paragraphs

To ensure your Knowledge Base high quality, we recommend you to review the extracted artciles and take the proper actions to improve them: open the URL from where the crawler extracted the paragraph and compare the content, edit or delete the paragraph. Refine your paragraphs by transforming unstructured data into a question-and-answer format.

To edit a paragraph, click the dots ( ) next to the paragraph and click Edit. Update the paragraph (user intent and answer) and click the Save icon at the top right corner of the page.

IMPORTANT! After making updates to your paragraphs, it's crucial to retrain your data source. This ensures the KB Engine recognizes these updates and provides accurate responses to user queries. Click the Train button at the top-left corner of the data source.

Fine-tuning Predictions

You can configure Advanced Settings at both the data source and node/leaf levels to achieve more precise predictions. This approach offers granular control, allowing you to adjust the extractors and trainable elements, resulting in better accuracy and performance. Unlike KB-level settings, which apply changes broadly, this targeted method adapts configurations to the unique needs of each data source or element, streamlining your authoring process.

Fine-tuning at the data source level

Navigate to the desired data source.
Select the Advanced Settings tab.
Modify advanced parameters as needed and save the settings.

Fine-tuning at the node or leaf level

In the tree explorer, select the desired node or leaf.
On the right side, select the Advanced Settings tab.
Modify advanced parameters as needed and save the settings.

Reset advanced settings

NOTE: This feature is available in DRUID version 7.16 onwards.

To reset advanced configurations at the data source and node/leaf levels to match the KB Advanced settings, go to Knowledge Base > Advanced Settings and click the Save to All button. This action streamlines your settings management by applying consistent KB Advanced settings across your entire configuration with just one click.

Enhance KB prediction

Refine your articles by transforming unstructured data into a question-and-answer format. Edit articles and add question / title / short description.

Access the Knowledge Base Advanced Settings, set the "trainableColumns" parameter to "Question,Answer", then train the Knowledge Base. The KB Engine will leverage both questions and answers from unstructured data sources during the prediction process, ultimately leading to improved prediction accuracy.

NOTE: For new bots created in DRUID 7.10 onwards, the engine will predict against both the question and answer by default. For existing bots, the engine will only predict against the answer ("trainableColumns": null) until you update the setting.

Authenticate with SharePoint by using a DRUID-generated certificate

To authenticate with SharePoint using a DRUID-generated certificate, follow these steps:

Step 1. Create certificate

Go to the Knowledge Base Advanced Settings and click on Authentication Certificates.

Click the Create Certificate button. The Create new certificate pop-up appears.

Enter a name for the certificate (you will use it to identify the certificate in the Druid Portal), select the certificate expiry date from the Ending at field and click the Create button.

The certificate appears in the Certificates list. Download it on your computer.

Click the Save&Close button.

Step 2. Import the certificate in Azure

To import the DRUID-generated certificate in Azure, follow these steps:

In the Azure Portal, go to Application registration > All applications > SharePoint File Discovery.
From the left menu, click Manage and select Certificates & secrets.
On the page, click the Certificates tab, then click Upload certificate.

Browse for the certificate you downloaded from the Druid Portal and select it.

Once the certificate is successfully uploaded, you can create the SharePoint data source.

Step 3. Create the SharePoint Data Source

Create the SharePoint data source following the procedure described in section Create data source with the following specific settings: tap on Use Certificate, then select the desired certificate from the Certificate field.

Authenticate with SharePoint when the DRUID-generated certificate expired

If you have a SharePoint data source that uses a DRUID-generated certificate for authentication and the certificate has expired, follow these steps:

Step 1. Delete the expired certificate and create a new one

Go to the Knowledge Base Advanced Settings and click on Authentication Certificates. In the Certificates list, click the delete icon inline with the expired certificate.

In the confirmation dialog, click Yes to confirm the certificate deletion.

Now you can create a new certificate or you can use a valid certificate.

To create a new certificate, click the Create Certificate button. The Create new certificate pop-up appears.

Enter a name for the certificate (you will use it to identify the certificate in the Druid Portal), select the certificate expiry date from the Ending at field and click the Create button.

The certificate appears in the Certificates list. Download it on your computer.

Click the Save&Close button.

Step 3. Import the certificate in Azure

To import the DRUID-generated certificate in Azure, follow these steps:

In the Azure Portal, go to Application registration > All applications > SharePoint File Discovery.
From the left menu, click Manage and select Certificates & secrets.
On the page, click the Certificates tab, then click Upload certificate.

Browse for the certificate you downloaded from the Druid Portal and select it.

Step 3. Select the new certificate on the SharePoint data source

From the Knowledge Base page, click on the SharePoint data source with the expired authentication certificate and click the Details tab. Choose the new certificate from the Certificate field and click Save. To verify the credentials, click the Test button.